Add timeout to gRPC requests #72

camille-bouvy-frequenz · 2024-11-21T14:09:19Z

This PR introduces a backoff module that handles function calls with configurable timeout, which also increases with every timed-out call. This is then used for all non-streaming gRPC-related methods of the Client. The default timeout is set to 300 seconds (i.e. 5min), with flexibility to adjust as needed.

camille-bouvy-frequenz · 2024-11-27T10:41:30Z

This timeout is now increased dynamically with each failed connection attempt.

tiyash-basu-frequenz

It looks like the grpc_call_with_timeout is invoked per call. Maybe I am wrong, but wasn't the idea to implement the backoff at the client level, and not at a call level?

If we indeed need to implement it at a client level, then we could send the commands via a channel to a central client task, that can impose a global backoff logic. It would also need logic to reset the backoff after a certain period passes since the last successful RPC call.

camille-bouvy-frequenz · 2024-11-27T11:48:43Z

Hmm, fair point that the timeout could be increased at the client level. That said, some gRPCs inherently take much longer than others (e.g., list_gridpool_orders), which is why I was thinking of having this handled at the gRPC level. However, I’m definitely not opposed to managing it at the client level either.

For me, the most important thing is that we have a timeout in place (whether it’s dynamically decided or fixed) to avoid hanging calls.

tiyash-basu-frequenz · 2024-11-27T13:58:55Z

I see. Different RPCs having different timeouts needs to be considered, true.

The main objective of timeouts would be to reduce stress on the client's connection. Having per-call timeout violates this principle. One call might be backing off, but then a new call might go straight through. So my guess is that while you may see some benefits of this approach right away, it would be a matter of chance when the connection to the server gets stressed. The more client apps, the further it will regress.

camille-bouvy-frequenz · 2024-11-28T16:45:36Z

I currently have a max_timeout (maximum timeout duration for gRPC calls), timeout_increment (increment value for timeout on each retry, e.g. 20secs) and a max_timeout_retries (maximum number of retry attempts when a timeout is reached). Having those 3 vars could ba a bit of an overkill, I could remove one of them if needed.

tiyash-basu-frequenz

The addition itself looks good. I have a few improvement suggestions.

src/frequenz/client/electricity_trading/_client.py

tiyash-basu-frequenz

Looks very good! I would just suggest adding a few tests to nail down stability and prevent regression, and done.

src/frequenz/client/electricity_trading/_backoff.py

shsms · 2024-12-09T17:19:46Z

Sorry, I'm just realizing this has become a retry implementation for non-streaming calls.

Retries of this sort have to be infinite because we have to keep trying until the API call succeeds, and doing that in the client would take away the actor's agency in deciding what to do when there is a client error, like taking fallback action, sending a slack message, etc.

It also introduces the risk of creating orders with outdated power or price forecasts if the client's retry succeeds after a few minutes.

For this reason, all of our clients should have retries only for streaming APIs (which are taken care of by the GrpcStreamBroadcaster in the base client).

For non-streaming APIs, retries can be implemented in the actors using the ExponentialBackoff provided by the base client. I'm also adding a receiver interface for this in the base client, allowing us to use a select loop to decide when to retry.

Also, I've implemented these in the actor for now in this PR: https://github.com/frequenz-io/frequenz-actor-electricity-trading/pull/249

So, I think this PR should go back to just implementing a timeout for non-streaming methods to prevent them from blocking for too long and limiting the actor's ability to take fallback action.

tiyash-basu-frequenz · 2024-12-10T09:16:33Z

Then all that needs to be done is invoke only simple RPCs with the backoff, and call the streaming ones directly. If a backoff module exists in the base client, then that could be used directly (and sorry that you implemented it here @camille-bouvy-frequenz).

I also would not recommend going back to a non-global backoff, since I do not see a value in that.

shsms · 2024-12-10T12:49:06Z

Then all that needs to be done is invoke only simple RPCs with the backoff

Not in the client, for the reasons mentioned above. We just need timeout to be implemented in this PR.

I also would not recommend going back to a non-global backoff, since I do not see a value in that.

Top level retry is provided by the actor implementation. That gives the most flexibility. And there are plans to add additional backoff mechanisms and failure detection based on run times, etc. And then it is needed only in one place. If users need additional control, they can do it in their actors.

src/frequenz/client/electricity_trading/_client.py

shsms · 2024-12-10T13:46:36Z

src/frequenz/client/electricity_trading/_client.py



+async def grpc_call_with_timeout(
+    call: Callable[..., Awaitable[Any]], *args: Any, timeout: float = 300, **kwargs: Any


I think this should take a timedelta | None and the default value should be None here and in the other calls. Otherwise it wouldn't be intuitive and not what you'd expect from a thin wrapper.

I agree about the timedelta but I don't think that the timeout should be None by default. Else we risk having the API stuck again (without giving any errors) like we did last week

Well, the actor would set it when making calls to the client. The actor can get the value to use from the config file.

The 300s comes from specific issues we noticed in the server and what we assume the actor needs. We can't say that 300 is the right value in the long term. It could become different even as soon as we go to production.

No the 300s was arbitrary just to have some timeout in the API requests. And the issues didn't come from the actor's side but from the API side when it disconnected (without throwing any errors). I don't mind changing the 300s to another value, but I'd feel more comfortable having a timeout by default.

No the 300s was arbitrary just to have some timeout in the API requests.

This is just the thing. Defaults have to make sense, and we can't find a value that does, especially when we expect external parties to use it. Maybe we shouldn't add this here at all and keep it only in the actors? If, at some point, we get an SDK-like layer between the client and the actor, these problems would become much easier.

Maybe @llucax has an opinion? Because we try to be consistent with all the API clients.

In the worst case, we go ahead with 300s as the default if it works for most cases.

For example, the default receiver buffer size of 50 is a bit arbitrary and has bitten me sometimes. It didn't occur to me for a long time that it was overflowing and needed a bigger buffer. But still, having 50 as a default has been very useful in most places. The situation here is very different because None as buffer size would have meant no buffer, whereas here None as timeout means infinite timeout.

Yea ok I get your point. I can then set the default to None for now then & just make sure we call the functions with timeouts in the actors

tiyash-basu-frequenz · 2024-12-10T14:59:51Z

Then all that needs to be done is invoke only simple RPCs with the backoff

Not in the client, for the reasons mentioned above. We just need timeout to be implemented in this PR.

I also would not recommend going back to a non-global backoff, since I do not see a value in that.

Top level retry is provided by the actor implementation. That gives the most flexibility. And there are plans to add additional backoff mechanisms and failure detection based on run times, etc. And then it is needed only in one place. If users need additional control, they can do it in their actors.

So if I understand it right, you specifically are suggesting having per-RPC backoffs in this client? If yes, how would that help?

Nice that a global backoff module exists in the actor library!

shsms · 2024-12-10T15:14:14Z

you specifically are suggesting having per-RPC backoffs in this client?

No, I'm saying we shouldn't add automatic retry in the client for non-streaming RPCs. There is no question of backoff if there's no retry.

I'm only suggesting that we have timeout for such RPCs.

Signed-off-by: camille-bouvy-frequenz <[email protected]>

shsms

Just an optional comment.

RELEASE_NOTES.md

Signed-off-by: camille-bouvy-frequenz <[email protected]>

camille-bouvy-frequenz self-assigned this Nov 21, 2024

camille-bouvy-frequenz requested a review from a team as a code owner November 21, 2024 14:09

github-actions bot added the part:docs Affects the documentation label Nov 21, 2024

camille-bouvy-frequenz added this to the December deployment milestone Nov 26, 2024

camille-bouvy-frequenz force-pushed the add-timout-requests branch from 06b7577 to 5cb47e9 Compare November 27, 2024 10:35

camille-bouvy-frequenz requested a review from tiyash-basu-frequenz November 27, 2024 10:41

tiyash-basu-frequenz reviewed Nov 27, 2024

View reviewed changes

camille-bouvy-frequenz force-pushed the add-timout-requests branch 2 times, most recently from ff2e5ff to a0eb107 Compare November 28, 2024 16:43

camille-bouvy-frequenz requested a review from tiyash-basu-frequenz November 28, 2024 16:46

tiyash-basu-frequenz reviewed Dec 2, 2024

View reviewed changes

src/frequenz/client/electricity_trading/_client.py Outdated Show resolved Hide resolved

src/frequenz/client/electricity_trading/_client.py Outdated Show resolved Hide resolved

camille-bouvy-frequenz force-pushed the add-timout-requests branch from a0eb107 to cb447b8 Compare December 2, 2024 16:27

camille-bouvy-frequenz changed the title ~~Add timout to gRPC requests~~ Add timeout to gRPC requests Dec 2, 2024

tiyash-basu-frequenz reviewed Dec 3, 2024

View reviewed changes

src/frequenz/client/electricity_trading/_backoff.py Outdated Show resolved Hide resolved

src/frequenz/client/electricity_trading/_backoff.py Outdated Show resolved Hide resolved

camille-bouvy-frequenz force-pushed the add-timout-requests branch from cb447b8 to 2c94c28 Compare December 4, 2024 08:20

github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Dec 4, 2024

camille-bouvy-frequenz force-pushed the add-timout-requests branch from 2c94c28 to 7797f8d Compare December 4, 2024 08:28

camille-bouvy-frequenz requested a review from tiyash-basu-frequenz December 4, 2024 08:38

camille-bouvy-frequenz force-pushed the add-timout-requests branch 2 times, most recently from 32305d4 to f5cdbe7 Compare December 10, 2024 13:34

shsms reviewed Dec 10, 2024

View reviewed changes

Add a timeout in the gRPC function calls

12cd624

Signed-off-by: camille-bouvy-frequenz <[email protected]>

camille-bouvy-frequenz force-pushed the add-timout-requests branch from f5cdbe7 to cbc6978 Compare December 11, 2024 10:33

camille-bouvy-frequenz requested a review from shsms December 11, 2024 12:41

shsms previously approved these changes Dec 11, 2024

View reviewed changes

RELEASE_NOTES.md Outdated Show resolved Hide resolved

Update Release Notes

87f01cd

Signed-off-by: camille-bouvy-frequenz <[email protected]>

camille-bouvy-frequenz dismissed shsms’s stale review via 87f01cd December 11, 2024 12:57

camille-bouvy-frequenz force-pushed the add-timout-requests branch from cbc6978 to 87f01cd Compare December 11, 2024 12:57

shsms approved these changes Dec 11, 2024

View reviewed changes

camille-bouvy-frequenz added this pull request to the merge queue Dec 11, 2024

Merged via the queue into frequenz-floss:v0.x.x with commit b8c5439 Dec 11, 2024
14 checks passed

camille-bouvy-frequenz deleted the add-timout-requests branch December 11, 2024 14:10



		async def grpc_call_with_timeout(
		call: Callable[..., Awaitable[Any]], args: Any, timeout: float = 300, *kwargs: Any

Add timeout to gRPC requests #72

Add timeout to gRPC requests #72

Uh oh!

Conversation

camille-bouvy-frequenz commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camille-bouvy-frequenz commented Nov 27, 2024

Uh oh!

tiyash-basu-frequenz left a comment

Choose a reason for hiding this comment

Uh oh!

camille-bouvy-frequenz commented Nov 27, 2024

Uh oh!

tiyash-basu-frequenz commented Nov 27, 2024

Uh oh!

camille-bouvy-frequenz commented Nov 28, 2024

Uh oh!

tiyash-basu-frequenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tiyash-basu-frequenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shsms commented Dec 9, 2024

Uh oh!

tiyash-basu-frequenz commented Dec 10, 2024

Uh oh!

shsms commented Dec 10, 2024

Uh oh!

Uh oh!

shsms Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camille-bouvy-frequenz Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

shsms Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

camille-bouvy-frequenz Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

shsms Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

camille-bouvy-frequenz Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

tiyash-basu-frequenz commented Dec 10, 2024

Uh oh!

shsms commented Dec 10, 2024

Uh oh!

shsms left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

camille-bouvy-frequenz commented Nov 21, 2024 •

edited

Loading

shsms Dec 10, 2024 •

edited

Loading